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This paper presents, to the author's knowledge, the first graphics processing unit (GPU) accel- 
erated program that solves the evolution of interacting scalar fields in an expanding universe. We 
present the implementation in NVIDIA's Compute Unified Device Architecture (CUDA) and com- 
pare the performance to other similar programs in chaotic inflation models. We report speedups 
between one and two orders of magnitude depending on the used hardware and software while 
achieving small errors in single precision. Simulations that used to last roughly one day to compute 
can now be done in hours and this difference is expected to increase in the future. The program 
has been written in the spirit of LATTICEEASY and users of the aforementioned program should 
find it relatively easy to start using CUDAEASY in lattice simulations. The program is available 
at |http: / / www.physics.utu.fi/theory / pa rticlecosmology / cudaeasy / under the GNU General Public 
License. 



I. INTRODUCTION 

Recent developments in computer technology have made graphic processing units (GPUs) available to scientists as 
a powerful coprocessor in numerical computations. For example the theoretical peak performance of a state of the art 
AMD 5870 GPU is 2.7 tera floating point operations per second (TFLOPS) in single precision pQ compared to the 
peak performance 55.36 giga flops (GFLOPS) of Intel's Core i7 [2 . This difference in computational horsepower has 
made general-purpose computing on graphics processing units (GPGPU) an increasingly popular way to do numerical 

computations S U EE HI M M ED M ES El E3 ESI HZ1 HH [HI HH1 HH [21] - 

Due to the inherently parallel nature of computer graphics GPUs are well suited for problems that are easy to 
parallelize. In ideal situations a GPU version of a program can be up to three orders of magnitude faster [23] than 
the serial CPU version of the same program. Speedups of this magnitude can have dramatic effects on how science is 
made and make some previously computationally difficult problems rather trivial. 

The shear computational power of GPUs is however useless if if cannot be harnessed effectively. Therefore the 
use of computationally suitable language is of paramount importance. Recent developments in this sector have made 
programming of modern GPUs relatively easy. NVIDIA's CUDA architecture [M] and the OpenCL language [25] 
of the Khronos Group are similar to C syntax and therefore lower considerably the learning curve needed to write 
programs that use GPUs as co-processors. How GPUs are programmed in these languages is fairly similar and by 
learning one of these makes the other easy to learn as well. 

Cosmology has many computationally demanding problems [26, 27, 28 . The study of interacting fields and reheating 
after inflation in early universe is one of them. The nonlinear nature of the system compels to study the evolution 
of the system numerically. The problem then becomes the distretization of the scalar field equations and solving the 
evolution of the system in a lattice once the initial values have been set. This problem has been previously solved 
in LATTICEEASY [29] and DEFROST [3T] programs that use very different methods to solve the dynamics of the 
system. 

The lattice calculations are relatively easily parallelizable and therefore suitable for GPU computing. The aim of 
this paper is to present a GPU implementation of a program that solves the evolution of interacting scalar fields in 
an expanding universe. The main evolution algorithm is identical to the leap-frog method of LATTICEEASY but as 
was pointed out in [31 the initialization of LATTICEEASY is not entirely consistent. We have therefore adapted the 
initialization method from DEFROST and implemented this in the GPU. Our program can be therefore thought as 
an amalgam of the LATTICEEASY and DEFROST codes. 

We have chosen to write the program in the CUDA architecture mainly for pragmatig reasons: when we started to 
develop this code OpenCL was still in its infancy whereas CUDA was many generations old and had comprehensive 
documentation and a large group of developers. Note however that since OpenCL code can be run in both AMD 
and NVIDIA GPUs the future de facto GPU programming language is still undecided. In order to future proof the 
CUDAEASY program we are looking into porting it also into the OpenCL language. 
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This paper is organized as follows: in section II we will present the CUDA architecture, the equations of motion 
and our implementation of the lattice program. We will present numerical results in section III and we will conclude 
in section IV. 



II. GPU IMPLEMENTATION 



A. CUDA programming model 



NVIDIA introduced CUDA (an acronym for Compute Unified Device Architecture) in November 2006 as a pro- 
gramming language for NVIDIA GPUs |H] . CUDA comes with C for CUDA application programming interface that 
gives programmers familiar with C an easy way to start writing programs that are executed on GPU. This is acchieved 
by extending C with different CUDA commands. 

Within the CUDA framework GPUs are called devices whereas the CPU is known as the host. The fundamental 
concept in GPGPU are the lightweight threads that are executed in parallel on the device with kernel functions. 
Threads within a kernel constitute a two dimensional grid that is divided into different blocks that hold the threads. 
Blocks can be one, two or three dimensional and threads within the same block can share information through a 
shared memory. Threads and blocks have ids (threadldx and blockldx respectively) which can be used to calculate 
the global thread indexes within the grid. 

The device has a range of different memories that have different characteristics and roles in computing. Global or 
device memory is the main memory of the GPU but it is located off-chip and therefore has a considerable latency. 
Reads from the global memory should also be coalesced: threads within a block are divided into warps that are made 
of 32 threads. Threads within a half-warp (i.e. 16 threads) should read from a same segment of memory which size 
depends on the type of data. For example for floats this should be 128 bytes. If this requirement is not met there 
will be a performance penalty. The GPU also has a limited amount of fast shared memory that can be used to hide 
the latency of the global memory and to share data between threads within a block in order to save the memory 
bandwidth. The device also has registers and read-only texture and constant memories. A more accurate description 
of these can be found in the CUDA programming guide. We will be using a range of these memories in order to 
optimize the execution of the kernel. 

The device in NVIDIA GT200 series is divided into streaming multiprocessors (SM) that do the actual computations. 
Each SM has 8 single precision calculating units, 1 double precision unit and 2 super function units. Shared memory 
and registers are also located on the SMs. Once a kernel is launched on the device the resource usage of the kernel 
dictates how many blocks a SM can execute. Therefore it is crucial to keep the register and shared memory usage of 
a thread to its minimum whilst also minimizing the global memory fetches. The global reads should also be coalesced 
for the performance to be optimal which means that number of threads block should be a multiple of 16. The search 
for the optimal number of threads per block is however largely an exercise in trial and error. 



The dynamics of early universe are often modelled with classical scalar fields that are minimally coupled to gravity 
and interact with each other through the potential term. Starting from the action principle and by demanding Lorentz 
invariance the action is compelled to be a function of only the fields and their temporal and spatial first derivatives. 
Variation of this action with respect to the fields naturally leads to second order differential equations. In the case 

of reheating there arc N scalar fields i = 1. .... N which interact through potential V(4>\ <Pn) which leads to 

equations of motion 



where H = a/ a, A is the Laplacian with respect to comoving coordinates and dot denotes differentiation with respect 
to physical time. 

The spacetime has been assumed to be spatially homogeneous and isotropic i.e. a Friedmann-Robertson- Walker 
spacetime with scale factor a. The evolution of the scale factor is then determined by the Friedmann equations 



B. Equations of motion 



4>i + 3H<j>i 




(1) 



a 



AirGa 
3 



(2) 




2 



where the energy density and pressure are given by 

P = E + ^2 ( V( M 2 ) + Vfa) 

1 (3) 

p=E(^ 2 -^(w 2 )-^) 

and () denotes average over the volume of the lattice. Note that V is the gradient operator with respect to comoving 
coordinates. 

CUDAEASY uses a staggered leapfrog method to solve these equations of motion. This means that the field values 
and their derivatives are stored at different times and they are advanced in turns: 

(f>(t) = cj)(t-dt) + dt<fi(t-dt/2) 
(j>{t+dt/2) = <t>(t-dt/2) + dt<j>(t) (4) 
<P(t + dt) = cj){t)+dt(j>it + dt/2). 

For second order differential equations this method is however stable only if the corresponding differential equation 
has no first order time derivative terms, i.e. <f> — 0(</>). We will therefore move into units that will transform equation 
([lj into suitable form. This was presented in the LATT1CEEASY documentation [30] and we will only cite the results. 
The transformed variables read 



4> pr — Aa r (f), x pr = Bx, dt pr = Ba s dt (5) 

where the rescaling variables A, B, r and s can be set freely with the limitation that s — 2r + 3 = in order to 
eliminate the (f> pr term from the equation of motion. The scalar field equations then read in the new variables 
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where 



V P r = ^ S+2r V, (7) 

and V is also expressed in terms of the rescaled scalar fields <t> pr . 

There are now a range of different values these scaling parameters can take [30] while the limitation s — 2r + 3 = 
is met. For example by setting A — l/4>o where <^o is the value of the initially dominant scalar field will make the 
scalar fields of order of unity initially. If it is possible to identify a dominant potential term and to write it as 

V = |V (8) 

the recaling variables can be set to 

A-i, B-j*i»«>, and (9) 

The scale factor evalution in program units is determined by [30 



(_ s 2 )— + ^a-^-U £ \\V pr ^ pr f + a 2s+2 v) (10) 
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which we can now use in the leap frog method to evolve the scale factor. Note that this is done differently in 
DEFROST where the evolution of L = 1/H is solved first and the scale factor a evolution is solved from L. This 
has the advantage that the computationally complex gradient terms cancel out from the equations of motion. We 
chose to use the leap frog method mainly for pragmatic reasons: the lattice simulations are usually done with periodic 
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boundary conditions and when calculating the average over the volume of gradient terms we can use the divergence 
theorem to get 

J \V^\ 2 dV = J • V&dV = J • dS- J faAfadV (11) 

and the surface integral cancels out due to periodic boundary conditions. We can now use this result in the volume 
averages of the gradient terms in scale factor equation ( 10 ) and since the Laplacian is already calculated in the scalar 
field equations ([6]) the computation of the average gradient term actually has practically no computational effect. 
This turns out to be useful information in the GPU implementation. Note that this is valid only for the average 
gradient terms. When calculating the energy density and the pressure we still need to calculate the computationally 
difficult gradient terms. 

The initial values of the scalar fields are computed as presented in DEFROST [31 i.e. the homogeneous scalar field 
values are given as an input and the random quantum fluctuations are created by the program. The main differences 
come from the use of CUDA Fast Fourier Transform (CUFFT) instead of FFTW (an acronym for Fastest Fourier 
Transform in the West) and from the fact that we need to transform the scalar fields into program units that the leap 
frog uses i.e. 

4> pr = Aa r 4>, 

JJ A r-sl «\ ( 12 ) 
Ppr = S a <P + r — <p pr . 

After the initial values have been set we need to make the field values and their derivatives desynchronized by advacing 
the the scalar fields dt pr /2 steps forward. 



C. CUDA implementation 

As mentioned before GPUs are good at doing parallel computations whereas CPUs are much faster at serial 
calculations. We have therefore divided the computational labor between the GPU and the CPU based on how 
parallel different parts of the problem are. The scalar field equations ^ are the hardest part computationally but 
they can be parallelized easily since similar leap frog steps are taken at every point of the lattice with the same 
equations of motion. The evolution of the scale factor on the other hand is easy to evolve serially once the averages 
of the gradient and potential terms have been calculated. We will therefore solve the evolution of the scalar fields on 
the GPU and leave the scale factor evolution to the CPU. 

We will start by discretizing the equations of motion ^ and (10 1 in a cubic n 3 lattice in comoving coordinates 



with periodic boundary conditions. Since the computations on the GPU are done in single precision we have taken 
some additional steps to increase the numerical accuracy of the method. Compared to the naive discretization of 
the Laplacian used by LATTICEEASY we employ a better one presented by Patra and Karttunen in [31] that is 
second-order accurate and fourth-order isotropic. This more complex discretization uses all 26 neighbours of a 3x3 
cube compared to the six closest neighbours the naive discretization uses. This same stencil is also used in DEFROST 

EH- 

CUDA implementation of 3D stencils was presented in detail in 35J. We will follow a similar method that uses 
shared memory to calculate the Laplacians. Because of the limited amount of shared memory per SM (16 KB) the 3 
dimensional lattice has to be divided into smaller pieces that fit into the shared memory. To accomplish this we have 
sliced the lattice along the z-axis into smaller tiles and advance these tiles in ^-direction. The computations within 
these tiles are done by thread blocks. This means that a single thread of a thread block will advance all the scalar 
fields of a column with constant x- and y-coordinates. Since the outer threads of a thread block need the values of 
scalar fields that are in a different block and since different blocks can only communicate through global memory the 
role of the outer threads of a thread block is only to load data into shared memory and the scalar field computations 



are done by the inner threads (see figures 1(a) and 1(b) for illustration). This implementation unfortunately means 
that the reads are not fully coalesced and therefore additional speedups might be possible. One way would be to use 
texture memory to store the values of the scalar fields. This has not been implemented yet. 

An alternatice way to update the slices would be to do the calculations with all of the threads of a block and make 
some threads load more than one value into shared memory. We however noticed that this leads to thread branching 
within warps and numerous integer computations that slow the computations considerably and negate the speedup 
the additional threads would give. The overlapping of the threads in the current method however means that the 
thread indexes are not linear in the CUDA grid and extra care has to be taken to keep the globar memory reads 
consistent. We have done this with a device function that also takes into account the periodic boundary conditions. 
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threadld.x 
blockld.x 

(a) (b) 

FIG. 1: (a) Illustration of the CUDA grid with 16 thread blocks and 64 threads per block. The actual compute grid used in 
the program is much larger. The computations are done by the inner threads (green) of a block whereas the outer threads 
(light green) are used only to read data. Note that the thread blocks intersect each other, (b) Illustration of how data in the 
shared memory is updated. Thread blocks advance in z-direction and load data into the blue slice (shared_up) from the global 
memory. The previous values of blue slice are stored into the purple one (shared_middle) and the values from the middle slice 
are stored into the red block (shared_down). This way only a fraction of the memory bandwidth is needed. 



In order to reduce global memory usage every block uses three layers of shared memory: shared-down, shared_middlc 
and shared-up respectively. The tiles advance in z-direction and load new data into the shared-up memory from global 
memory. Before doing this update the previous values of shared ^middle are stored into shared_down and the values 
from shared-up are stored in the middle memory. This updating method will only need a fraction of the memory 
band with compared to a naive method where every thread would do 27 global memory loads for one calculation. 
Current implementation with 256 threads in a thread block needs roughly only w l/21th of this banwidth and 
therefore makes the kernel much faster. The computation starts from the bottom of the lattice i.e. z = and due to 
the periodicity of the lattice the initial values of shared-down are loaded from the top of the lattice. See fig 1(b) for 
an illustration of the update method. 

After the different shared memory slices have been updated the Laplacian can be calculated with the discretization 
mentioned before. Since the coefficients used in this dicretization are the same at every point of the lattice we have 
stored these in the constant memory of the GPU which is meant to store read-only data for the threads. This way 
we can lower the register need of a kernel and make the program faster. 

Once the Laplacian is calculated we can proceed to the actual leap frog step i.e. equations Q. The other terms 
beside the Laplacian depend only on the local values of the scalar fields and can therefore be stored in registers. 
Those terms in equation ([6| that either depend on the scale factor or are constants such as any factors in the 
potential derivatives are stored in the constant memory since these are constant throughout the lattice during a 
leap-frog step. This way we can save some of the registers but also avoid any divisions on the GPU which are much 
slower than additions or multiplications. Once the leap-frog steps have been taken the new values of the scalar field 
and its derivative are written into global memory and are used in the next step. 

Before advancing a thread block in z-direction we will do some additional calculations needed in the scale factor 
evolution: namely we need to calculate the averages of the squared gradient and the potential term in equation ( |10[ ). 
As explained previously the gradient term can be written in terms of the Laplacian and since this has been already 
calculated the gradient term is trivial to implement in the kernel. We simply make a new variable (called gpe for 
gradient and potential energy in the program) and subtract the values of (j) pr A pr (f)p r from it along the column the 
thread is advancing in z-direction. The potential terms are done similarly but they are written only for the last scalar 
field being evolved. We will employ additionally the Kahan summation [33 when calculating these sums in order to 
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keep the numerical errors as small as possible. 

Once the calculations needed in the averages are done the thread block advances in z-direction and loads new values 
into shared memory, calculates the Laplacian, advances the scalar fields and increments the gpe variable and advances 
again. Once the thread blocks reach the top of the lattice the evolution step finishes for the lattice and writes the 
values of gpe into global memory. This way the global memory bandwidth needed is reduced significantly compared 
to a method where the increment variables would be written at every point of the lattice since the values are now 
written only once by every thread. 

After finishing the evolution kernel for one scalar field the same calculations need to be done for the other fields as 
well. All of the fields could be in theory advanced with one kernel call but this would quickly lead to an extremely 
long kernel function suitable only for a very spesific model. Current implementation uses two fields but this can be 
easily increased for different more complex models. The scale factor evolution on the CPU is done once every scalar 
field has been evolved. 

We have also written a more complex kernel function that is used to calculate additionally the energy density and 
pressure of the scalar fields i.e. equations The discretization of these was presented in [31] and the implementation 
in CUDA is similar to the Laplacian method explained previously. Since this kernel does more calculations and uses 
more registers the main evolution is done with the simpler and faster kernel. How often the energy density and 
pressure need to be calculated is left to the user. These quantities are written by the program in SILO format into 
seperate files that can be easily visualized with the VISIT program provided by LLNL (see figures 3(a)j|3(d) I . 



III. NUMERICAL RESULTS 

In order to test the program and verify the output we have solved the evolution of scalar fields in the simple chaotic 
inflation model which was studied in [32] with LATTICEEASY and in [31] with DEFROST. The potential in this 
case is 

V{^) = ^mV + j^W, (13) 

where <f> is the inflaton and ip the decay product. We will use units similar to DEFROST i.e. we will set the reduced 
Planck mass m v \ = (SttG)^ 1 ^ 2 equal to one in the computations. The initial values have been set to coincide with 
the values given in [3T][32]. This means that homogenous value of the inflaton is set to cj) ~ 1.009343 and the decay 
field ip = 0, the mass of the inflaton equals 5 • 10 6 m p i, the Hubble parameter H ~ 0.50467m, and the value of the 
coupling constant g — 100. We have evolved the system with two different time steps {dt — 2~ 9 /m and dt = 2~ 10 /m) 
and lattice sizes (128 3 and 256 3 respectively). The rescaling parameters have been chosen to be 

1 3 

A = — , B — y/m, r=~ and s = 0, (14) 
0o 2 

which means that dt pr = dt. We have evolved this system with 2 18 time steps which corresponds to evolution times 
of t = 512/m and t — 256/m depending on the time step and lattice size. The larger lattice corresponds to the one 
used in DEFROST. 

The CUDA kernels are run on a NVIDIA GTX 260 SP 216 with 896 MB of global memory which can be bought at 
the time of writing for ~ 200 $ from a well equipped computer store. The host computations are run on a quadcore 
AMD Phenom processor with 4 GB of host memory. We chose to use thread blocks that contain 324 threads of which 
computations are done by 256 threads. This configuration was found to be the fastest of 64, 128 or 256 threads per 
block. The CUDA grid size is based on the size of lattice. For the 256 3 case we chose to use 256(= 16 2 ) thread blocks. 

As noted previously the computations are done with two kernels depending on if the energy density and pressure 
need to be calculated. The simpler of these does most of the calculations whereas the other kernel is cuttently called 
once in 512 steps. Because of the complicated gradient functions the kernels are under heavy register pressure. The 
main evolution kernel for example needs 24 registers per thread. Despite our best efforts we haven't yet been succesful 
in lowering this register pressure. If future generations of GPUs have more registers we expect more speedups from 
the program. _____ 

The numerical results are shown in figures 2(a)|3(d) Fig. 2(a) shows the evolution l/(a 3 ^ 2 H) as a function of 



the scale parameter with two different lattice sizes: the evolution is plotted in green for 128 3 lattice and in blue for 
256 3 lattice. Initially the systems evolve differently but as time progresses the the same evolution is restored. The 
difference is mainly due to different initializations. These figures show that the system produces the same evolution 
as the DEFROST program. In fig. |2(b)| we have plotted the absolute value of the relative residual curvature 
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(c) (d) 

FIG. 2: (a) The evolution l/(a 3/2 H) as a function of the scale parameter in 128 3 (green) and 256 3 (blue) lattices, (b) The 
absolute value of the relative residual curvature J^-i = (^njr^ — m the 128 3 (green) and 256 3 (blue) lattices. Once the 
parametric resonance starts the errors increase slightly but these are generally very small considering that the evolution is 
calculated in single precision, (c), (d) The evolution of the effective equation of state w = (P) / (p) in the lattice of (c) 128 3 and 
(d) 256 3 elements. 



as a function of time in the different lattices which we use as a proxy for numerical errors. The numerical errors 
clearly dilute as time progresses altough the parametric resonance silghtly increases it. Considering that the field 
equations are solved in single precision residual errors this small can be considered exceptional. Current version of 
the code could be easily made to run also in double precision but this would drop the performance to one eight of the 
current throughput due to the number of double precision units on the chip. The size of global memory would also 
limit the size of the lattice to be smaller in double precision. 

Besides the system evolution we have also plotted 3D visualizations of the evolution of the energy density in the 
256 3 lattice in figures |3(a)|3(d)| These fi gures were made with LLNL's Visit visualization program. The plots show 
the logarithm of energy density at t = 0, t = 64/m, i = 128 /m and t — 256/m respectively. The start of parametric 
resonance can be clearly seen in figure 3(c)| 

The total computation time per one step is roughly 0.022 seconds which corresponds to a total running time of 
one hour and 36 minutes for one simulation with 256 3 elements. The smaller lattice 128 3 with 2 18 steps is solved in 
less than 900 seconds which corresponds to less than 0.0034 seconds per one time step. Compared to the DEFROST 
program this is an order of magnitude faster (DEFROST reports a computation time of 0.32 seconds per time step 
on an Intel Xeon computer for the 256 3 lattice). Compared to LATTICEEASY this speedup increases by a factor of 
four |31j . What this means is that simulations that used to last roughly one day can now be computed in hours and 
in the future this speedup is expected to increase even more due to software improvements and due to the more rapid 
increase of computational power of GPUs compared to CPUs [2"2"] . 



7 




I 
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(b) 





(c) 



(d) 



FIG. 3: The evolution of energy density during simulation. Plots show logarithm of energy density at (a) t = 0, (b) t = 64/m, 
(c) t — 128/m and (d) t = 256/m. The start of parametric resonance can be clearly seen in figure [3(c)| After the resonance 
energy densities dilute to the state shown at 3(d) 



IV. DISCUSSION AND CONCLUSIONS 



We have presented a GPGPU implementation of a program that solves the evolution of interacting scalar fields in 
an expanding space. The main implementation has been done in the spirit of LATTICEEASY which means that users 
familiar with aforementioned program should find it relatively easy to start using CUDAEASY. We have implemented 
most of the improvements that were first presented in DEFROST including making the initializations of the program 
consistent and using more advanced discretizations of differential operators. Because of these and other improvements 
the program achieves roughly same precision as DEFROST while using single precision in the GPU calculations. 

We would like to note that we have not (at least yet) implemented all of the functionalities of LATTICEEASY and 
DEFROST: namely the current version of CUDAEASY doesn't yet produce any quantative data about the spectrum 
but this will be rather trivial to implement with the CUD A FFT as a post evolution procedure. This is however left 
to future versions of the program. 

Other improwements might include a multi-GPU version of CUDAEASY that could be used to do simulations in 
much larger lattices. Periodic boundary conditions and the initialization of the lattice might pose some problems but 
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these questions are beyond the scope of this paper. Graphics processing units also make it possible to visualize the 
evolution of the system in real time: one example are the N-body simulations presented in [22] • In CUDAEASY this 
could be applied to smaller lattices that are simulated in a few minutes. For larger (~ 256 3 ) lattices the real time 
visualization might however have a performance hindering effect. 

Another interesting venture would be the porting of CUDAEASY into OpenCL. At the time of writing there are 
however some obstacles before this can be done though. Despite our best efforst we haven't yet found a solid FFT 
implementation on OpenCL. Once this situation improves CUDAEASY should be relatively easy to port to OpenCL 
language and therefore harness the computational power of AMD GPUs also into preheating calculations. 

Since the aim of this paper was to present a GPU implementation of LATTICEEASY and DEFROST we have 
not done any original research in cosmology. This is the aim of a future paper where we will study non-perturbative 
curvaton decay with CUDAEASY. This has been previously studied in [33] where AN formalism was used to calculate 
the non-gaussianity [35J [37] [35] generated by the resonant curvaton decay. We believe that with CUDAEASY the 
needed simulations can now be run much faster without sacrificing the accuracy of the results. 

Besides solving preheating evolution CUDAEASY can also be used after slight modifications to study the evolution 
of Q-balls [40 , gravity waves, phase transitions, black hole production and many more interesting phenomena in 
early universe |41j . With the computational power of GPUs many of these problems can now be studied much faster 
than previously even on a common desktop computer without losing accuracy. We hope that CUDAEASY is able to 
achieve this ambitious goal and be useful to the interested user. 
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